Streaming N:1 compaction by dian-lun-lin · Pull Request #659 · datastax/jvector

dian-lun-lin · 2026-04-14T22:15:59Z

This PR addresses #580 that adds OnDiskGraphIndexCompactor, a streaming N:1 compaction algorithm for merging multiple on-disk HNSW graph indexes into a single compacted index.

source[0].index  ─┐
source[1].index  ─┤──► OnDiskGraphIndexCompactor ──► compacted.index                                                                                       
source[N].index  ─┘

For a full description of the algorithm and benchmarking instructions, see docs/compaction.md and benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/CompactorBenchmark.md.

Support

Streaming, low-memory: no full in-memory graph construction; runs under -Xmx5g even for 10M-vector, 2560-dim datasets
Deletion support: live-node FixedBitSet per source excludes deleted nodes from output
Ordinal remapping: user-provided OrdinalMapper maps each source's local ordinals to a contiguous global ordinal space; implemented OffsetMapper that handles the common sequential case

Usage

List<OnDiskGraphIndex> sources = List.of(index0, index1, index2);
                                                                                                                                                           
// Mark all nodes live (no deletions)                                                                                                                      
List<FixedBitSet> liveNodes = sources.stream()                                                                                                             
    .map(s -> { var bs = new FixedBitSet(s.size()); bs.set(0, s.size()); return bs; })                                                                     
    .collect(toList());                                                                                                                                    
                                                                                                                                                           
// Sequential ordinal remapping: source[s] node i → global offset[s] + i                                                                                   
int offset = 0; 
List<OrdinalMapper> remappers = new ArrayList<>();                                                                                                         
for (var src : sources) {
    remappers.add(new OrdinalMapper.OffsetMapper(offset, src.size()));                                                                                     
    offset += src.size();
}                                                                                                                                                          
                
var compactor = new OnDiskGraphIndexCompactor(                                                                                                             
    sources, liveNodes, remappers,
    VectorSimilarityFunction.COSINE,                                                                                                                       
    /* executor= */ null                  // null = create internal ForkJoinPool
);                                                                                                                                                         
                
compactor.compact(Path.of("compacted.index"));

Key Changes

OnDiskGraphIndexCompactor — core compaction algorithm with parallel ForkJoinPool execution and backpressure windowing
PQRetrainer — balanced proportional sampling + sequential sorted reads for efficient codebook retraining
Minor API visibility changes to GraphSearcher, GraphIndexBuilder, and PQVectors required by the compactor
CompactorBenchmark — JMH benchmark with PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, and BUILD_FROM_SCRATCH modes
Unit tests covering basic compaction, deletions, ordinal remapping, and FusedPQ scenarios

Recall

Comparison against build-from-scratch (results averaged over three runs).

Build from scratch: build with PQ, search using FusedPQ with FP reranking.
Compaction: build source partitions with PQ, compact using FusedPQ with FP rescoring, search using FusedPQ with FP reranking. Source partitions are based on a Fibonacci distribution with 4 partitions.

Dataset	Dim	Build from Scratch	Compaction	Delta
cap-6M	768	0.626	0.619	-0.008
cap-1M	768	0.656	0.656	0.000
gecko-100k	768	0.690	0.701	+0.011
e5-small-v2-100k	384	0.572	0.586	+0.014
ada002-1M	1536	0.687	0.703	+0.016
e5-base-v2-100k	768	0.676	0.692	+0.016
cohere-english-v3-10M	1024	0.544	0.561	+0.017
e5-large-v2-100k	1024	0.686	0.703	+0.017
ada002-100k	1536	0.751	0.769	+0.018
cohere-english-v3-1M	1024	0.593	0.612	+0.019

Recall is generally comparable to build-from-scratch and often better, though some datasets show small drops. All datasets compact successfully under -Xmx5g; compaction has also been validated on a 2560-dim 10M-vector dataset under the same constraint.

Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1 merging of on-disk HNSW indexes without full in-memory materialization. Supports deletion filtering via live-node bitsets, custom ordinal mapping, and PQ codebook retraining.

Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.

Add JFR recording, system stats collection, JSONL logging, git info capture, thread allocation tracking, dataset partitioning, and cloud storage layout utilities used by CompactorBenchmark. Switch jvector-examples logging from logback to log4j2 for consistency with benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.

JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR recording, and JSONL result logging. Includes BenchmarkParamCounter for progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow, and exec-maven-plugin integration. Add forced vectorization provider property to VectorizationProvider for benchmark reproducibility.

Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.

github-actions · 2026-04-14T22:16:11Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.

Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.

jshook · 2026-04-16T21:58:55Z

+- Compaction: build source partitions with PQ; compact using FusedPQ with FP rescoring; search using FusedPQ with FP reranking.
+
+| Dataset              | Dim  | Build from Scratch | Compaction |  Delta |
+|----------------------|-----:|-------------------:|-----------:|-------:|


Do we need to modify this section for brevity?

I revised to the following:

Recall comparison (results averaged over three runs):

Build from scratch: build one index over the full dataset with PQ scoring; search using FusedPQ with FP reranking.

Compaction: partition the dataset into 4 source indexes (Fibonacci distribution), build each with PQ scoring, then compact into one index; search using FusedPQ with FP reranking.

jshook · 2026-04-16T22:00:05Z

+     * Handles writing the compacted graph index to disk, managing header, node records,
+     * upper layers, and footer in the on-disk format.
+     */
+    private static final class CompactWriter implements AutoCloseable {


Maybe break this out. The parent file is already very large.

Extracted CompactWriter into its own top-level file.

jshook · 2026-04-16T22:04:01Z

+public final class SystemStatsCollector {
+    private static final Logger log = LoggerFactory.getLogger(SystemStatsCollector.class);
+
+    private static final String SCRIPT = String.join("\n",


Perhaps this logic shouldn't be broken out as it is. Instead of invoking a shell wrapper, it should probably be direct reads and pattern matching in Java.

Thanks for the comment. The bash ProcessBuilder is replaced with a ScheduledExecutorService that reads /proc/cpuinfo, /proc/meminfo, /proc/loadavg, and /proc/diskstats directly via java.nio.file.Files. Same JSONL output format.

- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md

dian-lun-lin added 5 commits April 14, 2026 15:06

Add compaction unit tests

52e7217

Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.

Update build config and project metadata for compaction

c75256a

Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.

dian-lun-lin requested review from MarkWolters, jshook and tlwillke as code owners April 14, 2026 22:16

dian-lun-lin added 2 commits April 14, 2026 22:52

Fix JMH jar selection in run-compaction.yml

415f907

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.

Fix CompactorBenchmark invocation in run-compaction.yml

224a709

Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.

jshook reviewed Apr 16, 2026

View reviewed changes

Address PR review feedback

191a40d

- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming N:1 compaction#659

Streaming N:1 compaction#659
dian-lun-lin wants to merge 8 commits intomainfrom
compaction-pr

dian-lun-lin commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited by dian-lun-lin

Loading

Uh oh!

jshook Apr 16, 2026

Uh oh!

dian-lun-lin Apr 17, 2026

Uh oh!

jshook Apr 16, 2026

Uh oh!

dian-lun-lin Apr 17, 2026

Uh oh!

jshook Apr 16, 2026

Uh oh!

dian-lun-lin Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dian-lun-lin commented Apr 14, 2026

Support

Usage

Key Changes

Recall

Uh oh!

github-actions bot commented Apr 14, 2026 • edited by dian-lun-lin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jshook Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dian-lun-lin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jshook Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dian-lun-lin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jshook Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dian-lun-lin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 14, 2026 •

edited by dian-lun-lin

Loading